Documentation Index
Fetch the complete documentation index at: https://docs.platform.qubrid.com/llms.txt
Use this file to discover all available pages before exploring further.
NVIDIA · Chat / LLM · 120B Parameters (12B Active) · 256K Context (up to 1M)

Function Calling Streaming Reasoning Agent Workflows Long Context Code Tool Use
Overview
NVIDIA Nemotron-3 Super 120B A12B FP8 is an open-weight LLM built for agentic reasoning and high-volume enterprise workloads. Using a hybrid LatentMoE architecture (Mamba-2 + MoE + Attention) with Multi-Token Prediction (MTP) and native NVFP4 pretraining on 25T tokens, it delivers up to 2.2x higher throughput than GPT-OSS-120B and 7.5x higher than Qwen3.5-122B. With a native 1M-token context window, configurable thinking mode, and 60.47% on SWE-Bench Verified, it is purpose-built for collaborative agents, long-context reasoning, and IT automation across 7 languages — served instantly via the Qubrid AI Serverless API.
⚡ 2.2x throughput vs GPT-OSS-120B. 1M token context. 512 experts, 22 active per token.
Deploy on Qubrid AI — no H100 cluster required.
Model Specifications
| Field | Details |
|---|
| Model ID | nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 |
| Provider | NVIDIA |
| Kind | Chat / LLM |
| Architecture | LatentMoE — Mamba-2 + MoE + Attention hybrid with MTP; 512 experts, 22 active per token; 120B total / 12B active |
| Parameters | 120B total (12B active per inference pass) |
| Context Length | 256K Tokens (up to 1M) |
| MoE | No |
| Release Date | March 11, 2026 |
| License | NVIDIA Nemotron Open Model License |
| Training Data | 25T token corpus (NVFP4 native pretraining): web, code, math, science, multilingual; post-training cutoff February 2026; pre-training cutoff June 2025 |
| Function Calling | Supported |
| Image Support | N/A |
| Serverless API | Available |
| Fine-tuning | Coming Soon |
| On-demand | Coming Soon |
| State | 🟢 Ready |
Pricing
💳 Access via the Qubrid AI Serverless API with pay-per-token pricing. No infrastructure management required.
| Token Type | Price per 1M Tokens |
|---|
| Input Tokens | $0.10 |
| Input Tokens (Cached) | $0.04 |
| Output Tokens | $0.50 |
Quickstart
Prerequisites
- Create a free account at platform.qubrid.com
- Generate your API key from the API Keys section
- Replace
QUBRID_API_KEY in the code below with your actual key
💡 Temperature & Top P: Use temperature=1 and top_p=0.95 — recommended for all tasks with this model.
Python
from openai import OpenAI
# Initialize the OpenAI client with Qubrid base URL
client = OpenAI(
base_url="https://platform.qubrid.com/v1",
api_key="QUBRID_API_KEY",
)
# Create a streaming chat completion
stream = client.chat.completions.create(
model="nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8",
messages=[
{
"role": "user",
"content": "Explain quantum computing in simple terms"
}
],
max_tokens=16000,
temperature=1,
top_p=0.95,
stream=True
)
# If stream = False comment this out
for chunk in stream:
if chunk.choices and chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
print("\n")
# If stream = True comment this out
print(stream.choices[0].message.content)
JavaScript
import OpenAI from "openai";
// Initialize the OpenAI client with Qubrid base URL
const client = new OpenAI({
baseURL: "https://platform.qubrid.com/v1",
apiKey: "QUBRID_API_KEY",
});
// Create a streaming chat completion
const stream = await client.chat.completions.create({
model: "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8",
messages: [
{
role: "user",
content: "Explain quantum computing in simple terms",
},
],
max_tokens: 16000,
temperature: 1,
top_p: 0.95,
stream: true,
});
// If stream = false comment this out
for await (const chunk of stream) {
if (chunk.choices[0]?.delta?.content) {
process.stdout.write(chunk.choices[0].delta.content);
}
}
console.log("\n");
// If stream = true comment this out
console.log(stream.choices[0].message.content);
package main
import (
"bufio"
"bytes"
"encoding/json"
"fmt"
"net/http"
)
func main() {
url := "https://platform.qubrid.com/v1/chat/completions"
data := map[string]interface{}{
"model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8",
"messages": []map[string]string{
{
"role": "user",
"content": "Explain quantum computing in simple terms",
},
},
"temperature": 1,
"max_tokens": 16000,
"stream": true,
"top_p": 0.95,
}
jsonData, _ := json.Marshal(data)
req, _ := http.NewRequest("POST", url, bytes.NewBuffer(jsonData))
req.Header.Set("Authorization", "Bearer QUBRID_API_KEY")
req.Header.Set("Content-Type", "application/json")
client := &http.Client{}
res, _ := client.Do(req)
defer res.Body.Close()
scanner := bufio.NewScanner(res.Body)
for scanner.Scan() {
line := scanner.Text()
if line != "" {
fmt.Println(line)
}
}
}
cURL
curl -X POST "https://platform.qubrid.com/v1/chat/completions" \
-H "Authorization: Bearer QUBRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8",
"messages": [
{
"role": "user",
"content": "Explain quantum computing in simple terms"
}
],
"temperature": 1,
"max_tokens": 16000,
"stream": true,
"top_p": 0.95
}'
Live Example
Prompt: What are the benefits of renewable energy?
Response:
Renewable energy offers a broad set of benefits spanning environmental,
economic, and strategic dimensions:
Environmental
• Zero direct emissions during operation — solar, wind, and hydro produce
no CO₂, SOₓ, or NOₓ at the point of generation
• Reduces dependence on fossil fuel extraction, lowering habitat disruption,
water contamination risk, and land use pressure
• Contributes directly to national and international decarbonization targets
(Paris Agreement, net-zero commitments)
Economic
• Levelized cost of energy (LCOE) for solar and onshore wind is now lower
than new coal or gas plants in most markets
• Creates long-term stable energy pricing — no fuel cost volatility
• Generates local employment in manufacturing, installation, and maintenance
• Attracts investment: clean energy is the fastest-growing segment of
infrastructure capital globally
Energy Security
• Reduces import dependency on oil, gas, and coal from geopolitically
unstable regions
• Distributed generation (rooftop solar, microgrids) improves grid
resilience against outages and cyberattacks
• Nations with abundant sun, wind, or hydro can achieve near-complete
domestic energy independence
Technological Spillovers
• Advances in battery storage, smart grids, and power electronics driven
by renewables deployment benefit the broader economy
• Supports electrification of transport, heating, and industrial processes
Tradeoffs to consider:
• Intermittency (solar/wind) requires storage or grid balancing solutions
• Large-scale installations have land and materials requirements
• Upfront capital costs remain higher for some technologies vs. fossil alternatives
Net assessment: for most economies, renewable energy now represents the most
cost-effective, low-risk long-term energy strategy available.
Try it yourself in the Qubrid AI Playground →
Playground Features
The Qubrid AI Playground lets you interact with Nemotron-3 Super 120B directly in your browser — no setup, no code, no cost to explore.
🧠 System Prompt
Define the model’s role, reasoning mode, and output constraints before the conversation begins. Particularly powerful for agentic pipelines, tool-use workflows, and structured enterprise tasks.
Example: "You are an enterprise IT automation agent. Analyze incoming support
tickets, classify them by severity and category, suggest resolution steps,
and escalate critical issues with a structured JSON summary."
Set your system prompt once in the Qubrid Playground and it applies across every turn of the conversation.
🎯 Few-Shot Examples
Guide the model’s output structure and reasoning depth with concrete examples — no fine-tuning required. Especially effective for structured outputs and multi-step agentic tasks.
| User Input | Assistant Response |
|---|
Ticket: "Server keeps crashing every 12 hours." Priority? | Priority: HIGH. Category: Infrastructure Stability. Suggested action: Check system logs for OOM events, review cron jobs scheduled near crash window, and verify disk I/O health. |
Summarize this 50-page policy document in 5 bullet points | • Scope: Applies to all employees handling customer PII. • Key requirement: Data must be encrypted at rest and in transit. • Breach protocol: Notify DPO within 72 hours. • Retention: 7-year maximum. • Non-compliance: Subject to disciplinary review. |
💡 Stack multiple few-shot examples in the Qubrid Playground to shape agentic behavior, output schema, and reasoning verbosity — no fine-tuning required.
Inference Parameters
| Parameter | Type | Default | Description |
|---|
| Streaming | boolean | true | Enable streaming responses for real-time output |
| Temperature | number | 1 | Controls randomness in output. Recommended: 1.0 for all tasks |
| Max Tokens | number | 16000 | Maximum tokens to generate |
| Top P | number | 0.95 | Controls nucleus sampling. Recommended: 0.95 for all tasks |
Use Cases
- Agentic workflows and multi-agent collaboration
- Long-context reasoning (up to 1M tokens)
- IT ticket automation and high-volume enterprise workloads
- Complex tool use and multi-step function calling
- RAG (Retrieval-Augmented Generation)
- Software engineering and cybersecurity triaging
Strengths & Limitations
| Strengths | Limitations |
|---|
| LatentMoE: 512 experts / 22 active per token at same compute cost as standard MoE | Requires minimum 2× H100-80GB GPUs for local deployment |
| 2.2x throughput vs GPT-OSS-120B; 7.5x vs Qwen3.5-122B | Thinking mode adds latency overhead; low-effort mode recommended for simple queries |
| 60.47% SWE-Bench Verified; 83.73% MMLU-Pro; 79.23% GPQA | Not optimized for vision or multimodal inputs |
| Native 1M token context — 91.75% on RULER @ 1M | Function calling supported but may need prompt engineering for complex schemas |
| MTP speculative decoding: 3.45 avg acceptance length (up to 3x wall-clock speedup) | |
Configurable reasoning mode via enable_thinking=True/False | |
Why Qubrid AI?
- 🚀 No infrastructure setup — 120B MoE served serverlessly, pay only for what you use
- 🔁 OpenAI-compatible — drop-in replacement using the same SDK, just swap the base URL
- 💰 Cached input pricing — $0.04/1M for cached tokens, critical for long-context and repeated RAG workloads
- ⚡ Throughput-optimized — Nemotron’s 2.2x speed advantage is fully realized on Qubrid’s low-latency infrastructure
- 🧪 Built-in Playground — prototype with system prompts and few-shot examples instantly at platform.qubrid.com
- 📊 Full observability — API logs and usage tracking built into the Qubrid dashboard
Resources
Built with ❤️ by Qubrid AI
Frontier models. Serverless infrastructure. Zero friction.